Temporal Selective Max Pooling Towards Practical Face Recognition
نویسنده
چکیده
In this report, we deal with two challenges when building a real-world face recognition system the pose variation in uncontrolled environment and the computational expense of processing a video stream. First, we argue that the frame-wise feature mean is unable to characterize the variation among frames. We propose to preserve the overall pose diversity if we want the video feature to represent the subject identity. Then identity will be the only source of variation across videos since pose varies even within a single video. Following such an untangling variation idea, we present a pose-robust face verification algorithm with each video represented as a bag of frame-wise CNN features. Second, instead of simply using all the frames, we highlight the algorithm at the key frame selection. It is achieved by pose quantization using pose distances to K-means centroids, which reduces the number of feature vectors from hundreds to K while still preserving the overall diversity. The recognition is implemented with a rank-list of oneto-one similarities (i.e., verification) using the proposed video representation. On the official 5000 video-pairs of the YouTube Face dataset, our algorithm achieves a comparable performance with state-of-the-art that averages over deep features of all frames. Particularly, the proposed generic algorithm is verified on a public dataset and yet applicable in real-world systems.
منابع مشابه
Emergence of Selective Invariance in Hierarchical Feed Forward Networks
Many theories have emerged which investigate how invariance is generated in hierarchical networks through simple schemes such as max and mean pooling. The restriction to max/mean pooling in theoretical and empirical studies has diverted attention away from a more general way of generating invariance to nuisance transformations. In this exploratory study, we study the conjecture that hierarchica...
متن کاملLearning Robust Deep Face Representation
With the development of convolution neural network, more and more researchers focus their attention on the advantage of CNN for face recognition task. In this paper, we propose a deep convolution network for learning a robust face representation. The deep convolution net is constructed by 4 convolution layers, 4 max pooling layers and 2 fully connected layers, which totally contains about 4M pa...
متن کاملAction Representation Using Classifier Decision Boundaries
Most popular deep learning based models for action recognition are designed to generate separate predictions within their short temporal windows, which are often aggregated by heuristic means to assign an action label to the full video segment. Given that not all frames from a video characterize the underlying action, pooling schemes that impose equal importance to all frames might be unfavorab...
متن کاملSuccessful Decoding of Famous Faces in the Fusiform Face Area
What are the neural mechanisms of face recognition? It is believed that the network of face-selective areas, which spans the occipital, temporal, and frontal cortices, is important in face recognition. A number of previous studies indeed reported that face identity could be discriminated based on patterns of multivoxel activity in the fusiform face area and the anterior temporal lobe. However, ...
متن کاملSecond-order Temporal Pooling for Action Recognition
Most successful deep learning models for action recognition generate predictions for short video clips, which are later aggregated into a longer time-frame action descriptor by computing a statistic over these predictions. Zeroth (max) or first order (average) statistic are commonly used. In this paper, we explore the benefits of using second-order statistics. Specifically, we propose a novel e...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1609.07042 شماره
صفحات -
تاریخ انتشار 2016